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Abstract 

This paper studies a sparse signal recovery task in time-varying (time-adaptive) environments. 
The contribution of the paper to sparsity-aware onhne learning is threefold; first, a Generalized 
Thresholding (GT) operator, which relates to both convex and non-convex penalty functions, is 
introduced. This operator embodies, in a unified way, the majority of well-known thresholding rules 
which promote sparsity. Second, a non-convexly constrained, sparsity-promoting, online learning 
scheme, namely the Adaptive Projection-based Generalized Thresholding (APGT), is developed that 
incorporates the GT operator with a computational complexity that scales linearly to the number 
of unknowns. Third, the novel family of partially quasi-nonexpansive mappings is introduced as a 
functional analytic tool for treating the GT operator. By building upon the rich fixed point theory, 
the previous class of mappings helps us, also, to establish a link between the GT operator and a 
union of linear subspaces; a non-convex object which lies at the heart of any sparsity promoting 
technique, batch or online. Based on such a functional analytic framework, a convergence analysis 
of the APGT is provided. Furthermore, extensive experiments suggest that the APGT exhibits 
competitive performance when compared to computationally more demanding alternatives, such as 
the sparsity-promoting Affine Projection Algorithm (APA)- and Recursive Least Squares (RLS)- 
based techniques. 
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1 Introduction 



Sparsity-aware learning has been a topic at the forefront of research over the last ten years or so 
[Tl[2]. Considerable effort has been invested in developing efficient schemes for the recovery of sparse 
signal/parameter vectors. However, most of these efforts have focussed on batch processing, via the 
Compressed Sensing or Sampling (CS) framework. In CS, an iterative algorithm is mobilized to solve 
the estimation task once all measurements (training data) have been collected by the processing unit 
[iHS]. It is only very recently that online (time-adaptive) algorithms have been developed, where the 
training data are processed sequentially, and the sparse signal to be recovered has the freedom to be 
time- varying [6Vll2j. Both CS and online techniques share a common strategy, namely thresholding] i.e, 
a thresholding rule is used to impose sparsity-aware a-priori knowledge: some of the components of the 
signal/vector to be estimated are kept intact, while the rest of them are shrunk under some user-defined 
rule. Two thresholding operators dominate the literature: (i) hard thresholding, a brute force method, 
where shrinking is achieved by setting the size of some of the vector components to zero, and (ii) soft 
thresholding, where the shrinking operation is based on the (weighted) £i-norm of the vector. 

A large number of thresholding operators have been studied thoroughly, both in theoretical and 
experimental contexts, mainly within the statistics community |13H27j . It is by now well-established 
that hard thresholding, a discontinuous operator, has a tendency for larger variance of the estimates. 
Moreover, due to its discontinuity, hard thresholding can lead to instabilities, in the sense of being 
sensitive to small changes in the training data |21j . Soft-thresholding, is a continuous operator, that 
tends to introduce bias in the estimates. Therefore, alternative thresholding rules have been proposed 
in an effort to overcome these drawbacks [13lll8l[T9ll24p 27j . These advances in thresholding operators 
are strongly connected to optimization tasks; they are obtained by minimizing squared error terms 
regularized by, usually, non-convex penalty functions. 

The contribution of this paper is threefold. First, the generalized thresholding (GT) operator is 
introduced, which encompasses classical hard and soft thresholding rules, as well as the recent advances 
of [13H22t[MlI27j . Moreover, the proposed framework, motivated by the rich fixed point theory |28p29j. 
is general enough to provide means for designing novel thresholding rules and/or incorporating a priori 
information associated with the sparsity level, i.e., the number of nonzero components, of the sparse 
vector to be recovered. More importantly, GT is also allowed to non-convexly constrain the unknown 
vector. 

Second, the GT operator is incorporated into a signal/parameter estimation framework. Here, we 
choose the set theoretic estimation approach [3^, and in particular its online version, introduced in (31] 
and extended in |32p33j . In particular, the Adaptive Projection-based Generalized Thresholding (APGT) 
algorithm is proposed having three important merits, a) It is an online algorithm, b) it promotes sparse 
solutions effectively via the flexibility provided by the GT operator and c) its computational complexity 
scales linearly to the number of unknowns. With respect to performance, although APGT shows a low 
computational load, the experimental validation of Section [5] demonstrates that it exhibits a competitive 
performance even when compared to very recently developed, sparsity-promoting, and computationally 
more demanding alternatives, such as the APA- and RLS-based techniques [7t [34ti36] . 

It should be noted that the adopted set theoretic estimation framework was also utilized in 
where sparsity was induced via £i -based constraints, well-known to be convex and intimately connected 
to soft thresholding operations. In contrast, the fact that the GT operator is a "non-convex" mapping 
poses certain challenges for the convergence analysis of the algorithm. Specifically, the existing theory 
|3H - l33j which, so far, has been developed around convex sets and constraints is not rich enough to cover 
the APGT case. In order to theoretically support the incorporation of GT into learning mechanisms, 
such as the APGT, a novel family of operators, hereafter referred to as partially quasi-nonexpansive 
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mappings, is introduced, to the best of our knowledge, for the first time. It is the introduction of 
the partially quasi- nonexpansive mappings and their nice properties, which allowed the convergence 
analysis of APGT to be developed. These operators serve as a sound theoretical tool which allows 
the use of variational analysis [37] and fixed point theory |281l29j to attack non-convexly constrained 
learning problems. It is shown that GT belongs to this class of nonlinear mappings, with its fixed point 
set being a union of subspaces; a non-convex object which lies at the heart of any sparsity-promoting 
technique [381139]. 

It should be stressed that, propelled by such a generic operator theoretical framework, the pro- 
posed GT mapping offers a sound mathematical basis for infusing sparsity arguments into both batch 
(CS) and online approaches, beyond the set-theoretic framework adopted here. Moreover, the present 
manuscript shows a value beyond sparsity-aware learning. Through the novel concept of the partially 
quasi-nonexpansive mappings, this study stands also as the first step toward the extension of [311 - 133] to 
non-convexly constrained online learning tasks. 

The remainder of the paper is organized as follows. The problem under consideration is stated in 
Section [2j In Section [3l the GT operator is introduced. The proposed APGT algorithm is given in Sec- 
tion HI together with its properties and the definition of the novel family of partially quasi-nonexpansive 
mappings. Section [5] contains the experimental validation of APGT. A number of appendices support 
theoretically the developments exposed throughout the paper. More specifically, in App.[C]the proper- 
ties of the generalized thresholding operator are studied rigorously, and the convergence analysis of the 
proposed algorithm is performed in App[Dj A preliminary version of this study was presented in [40] . 

2 Problem Statement and Related Work 

We will denote the set of all non-negative integers, positive integers, and real numbers by N, N*, and 
M, respectively. Given any integers ji, j2, such that ji < j2, let ji, j2 := + 1, . . . ,^2}- 

The stage for discussion will be the Euclidean space M.^ , where L € . Given any pair of vectors 
ai,a2 € M^, the inner product in is defined as the classical vector-dot product (ai,ai) := aja2, 
where T stands for vector /matrix transposition. The induced norm is ||-|[ := \/{^^- 

Our discussion will revolve around the following celebrated linear model: 

Vn = u^a^ + Vn, Vn G N, (1) 

where a^, € is an unknown vector/signal, {Un,yn)n£N C x M is a sequence of known training data, 
and (fn)n6N stands for the noise process. In other words, the unknown a^, is "sensed" by a sequence of 
input vectors {un)neN, via the inner product of R^, in order to produce the noisy outputs (yn)neN- The 
vector a* is considered to be sparse, i.e., most of its components are zero. If we define ||a*||o to stand 
for the number of non-zero components of a*, then the assumption that a^, is sparse can be equivalently 
given by := ||a*||Q <^ L, and the vector a^, will be called K-t-sparse. 

This study attacks the following inverse problem: estimate the unknown sparse vector a* by utilizing 
the sequence of training data [u^, yn)n&i- A family of algorithms which shares a similar objective is the 
Compressed Sensing or Sampling (CS) framework [HE]- Given a fixed number S N^, of training data 
(ttj, a CS algorithm is mobilized in order to compute an estimate a„ of a*. CS belongs 
to the class of batch algorithms, i.e., in the case where the datum {un+i,yn+i) enters the system, a 
CS algorithm starts from scratch, and triggers a generally time consuming iterative procedure which 
operates on the data (uj, yj)"^„^_^^2 computing the updated estimate a.„+i of a^. In contrast to 
hatch learning approaches, this manuscript focuses on sparsity-aware online learning, i.e., an algorithmic 
framework which satisfies the following requirements. 
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1. The estimates of a^, should be updated in a simple and efficient way every time that a new datum 
Un) enters the system. The need to mobilize an optimization procedure from scratch, for every new 

datum {un,yn), as in CS, should be avoided. 

2. The operations needed in order to update the estimate should be of low computational complexity; 
hopefully of linear complexity with respect to the number of unknowns, i.e., 0{L). 

3. The unknown has also the freedom to be time-varying. Thus, an online learning scheme should 
be also able to quickly track any variations of a^. 

The mainstream of sparsity-promoting online methods exploits training data {Un,yn)neN in the 
context of classical adaptive filtering [51]; a quadratic objective function is used to quantify the designer's 
perception of loss. Additionally, a convex differentiable function is regularized by a sparsity promoting 
term, usually one that builds around the ii norm penalty function, and a minimizer of the resulting 
optimization task is sought either in the RLS or the LMS rationale, e.g., [BHS]. Another sparsity- 
promoting methodology, where different components of the vector estimates are weighted under several 
user-defined rules, is given by proportionate-type schemes |34H36] . Very recently, a novel online method 
for the recovery of sparse signals, based on set theoretic estimation arguments |301I42|. was developed 
in [11], and extended for distributed learning in |43| . 

The set theoretic estimation philosophy departs from the standard approach of constructing a loss 
function first; instead, it initially identifies a set of solutions which are in agreement with the available 
measurements as well as the available a-priori knowledge. A popular strategy is to define, at each time 
instance n G N, a closed convex subset of M^, by means of the training data pair (tt„,y„), to contain 
the unknown a^, with high probability. Different alternatives exist on how to "construct" such convex 
regions. A popular choice takes the form of a hyperslab around {un,yn), which is defined as: 

Sn[en] ■■= {a G : {u^a - y„| < e„}, V?i G N, (2) 

for some user-defined tolerance e„ > 0, and for ti„ ^ 0. The parameter e„ determines, essentially, 
the width of the hyperslabs, and it implicitly models the effects of the noise, as well as various other 
uncertainties, like measurement inaccuracies, calibration errors, etc. For example, if the noise were 
bounded, i.e., 3p > such that < p, Vn G N, then for any choice of > p it is easy to verify 
that a* G Sn[€n], Vn G N. A rigorous stochastic analysis in the case of bounded noise, where almost 
sure convergence of the sequence of estimates is proved for a special member of the rich family of the 
Adaptive Projected Subgradient Method (APSM) [3TH33] . can be found in [H]. In the case of unbounded 
noise, the well-known Tchebichev inequality |45j suggests that for any 

> 0, 

Probja, G Sn[en]] = Prob{|uIa, - y„| < e„} > 1 - ^' " , ^ = 1 - 

where Prob denotes probability, and E stands for the expectation operator. In other words, e„ defines 
also a measure of confidence in having the unknown a* in the hyperslabs ([2]). 

The (metric) projection mapping Pg^^^^T^ |29j onto the hyperslab /^^[en] ([2|) is given by the following 
simple analytic formula: 

if yn - en > ula, 

ii \u^^a - yn\ < e^, (3) 
if yn + en < ula. 
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In [TT] sparsity was induced within the convex analytic framework, and particularly via projections 
onto convex £i-balls. Here, the fixed point theoretical framework |28p29j is used in order to generalize 
the set theoretic estimation approach to support sparsity promoting constraints, which do not lie under 
the umbrella of convexity. This is realized via a novel operator theoretic framework, which embraces 
a wide range of thresholding rules referred to as Generalized Thresholding (GT) operators, described 
next. 

3 The Generalized Thresholding (GT) Mapping 

A couple of definitions are necessary prior to introducing GT mapping. 

Definition 1 (The ordered tuple notation). Given K & define the set of all ascending tuples of 
length K as ^{K,L) := {{li,l2, ■ ■ ■ ,1k) '■ 1 < h < h < ■ ■ ■ < 1^ L}. Clearly, the cardinality of 
^{K,L) is (^). An example of such an ordered tuple is the support of a vector x G M^, defined by 
supp(a;) := {I £ 1, L : xi ^ 0) £ ^{\ supp(a;)|, L), where | ■ | stands for the cardinality of a set. 

Definition 2 (Subspace associated to a tuple). Given J G ^{K, L), let Mj := {a G : ai = 0,\/l ^ j|. 
Clearly, Mj is a linear subspace of M^. Moreover, notice that if Ji C J2, then Mj^ C Mj^. In particular, 
if supp(a;*) C J, then jc* G Mj. An illustration of Mj can be found in Fig. [TJ 

Motivated by the hard thresholding operator, let us introduce here the main object of this study. 

Definition 3 (The mapping T^r^^). Fix a positive integer KG 1, L — 1 and define TQnj) : — > as 
follows. For any x G M^, the output z := T^^{x) is obtained according to the following steps: 

1. Compute, first, the tuple Jx^^ G J^{K,L) which contains the indices of the K largest, in absolute 
value, components of x. To avoid any ambiguity, in the case where we identify more than one component 
of X with the same absolute value, we always choose the one with smallest index. 

2. Define ci^^ := min||rE;| : / G J^^^j. In words, is the smallest among the K largest absolute 
values of the components of x. Clearly, V/ ^ \xi\ < 

3. Compute the components of z as: zi := xi, if I G and zi := shr(x;), if / ^ where the 
function shr : T>x — >■ M, with Vx '■= [—Cx^\Cx^'^]j satisfies the following properties: 

4. rshr(r) > 0, Vr G P^;- 

5. |shr(r)| < |r|, Vr G Vx- 

6. Going a step further than the previous property, we assume also that given any sufficiently small 
e > 0, there exists a 6 > 0, such that for any x G M'^, and Vr G T^x \ (— e,e), |shr(r)| < |r| — 6. In 
other words, 6 could be a user-defined parameter which guarantees that the function shr acts as a strict 
shrinkage operator for all the components of x with indexes not in J^^^ . The e parameter is introduced 
in order to exclude from the picture, since at this point the shr function usually takes the value of 0, 
i.e., shr(O) = (see Fig. [2D. 

Put in other words, the GT mapping operates as follows; given the input vector x, a number of K 
components of x, i.e., those with the K largest absolute values, are kept intact, while the rest of them 
are shrunk according to the shr function. See, for example. Fig. [H 
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Figure 1: An illustration of Tq^\ for the case of a 3-dimensional space, i.e., L := 3, and K := 2. Take 
for example the point ai. The K = 2 largest, in magnitude, coordinates of ai are the first two ones, i.e., 
1,2). The linear subspace 2) stands for all those vectors in M'^ where all the components. 



J. 



except from those in the positions (1,2), are equal to 0. The first two components of ai stay unaffected 
by , while the third one is shrinked by the function shr. If this third coordinate is set to 0, then 
Tq^^ acts as the hard-thresholding mapping On the other hand, the point 02 is already located 

in M(2,3), i.e., its first coordinate is 0. Hence, the application of to a2 has no effect, and a2 stays 
fixed to its original position. 
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Figure 2: Plots of Penalized Least-Squares Thresholding Operators (PLSTO) for various choices of the 
penalty function p m ([5]). 



The function shr is user-defined and it can get various forms as long as it complies with the properties 
described before. As an example, a thresholding operator in the GT family based on an arbitrary shr 
function is shown in Fig. [2^. Note that it comprises both discontinuities and nonlinear regions. A more 
systematic way to built GT's is via the univariate Penalized Least Squares optimization task; given 
a G M, 

mm- {a - af + Xp{\a\), (4) 

a Z 

where p{-) is nonnegative, nondecreasing and differentiable function on (0,oo). This problem is at the 
heart of many batch sparsity promoting algorithms as it is discussed in App. [Bl It turns out, that 
@ has, in general, a unique solution which is obtained when a is properly thresholded/shrinked |21j . 
Accordingly, let us define the Penalized Least-Squares Thresholding Operator (PLSTO) as the mapping 
which maps a given d to the previous unique minimizer: 

^PLSTO ■ « ^ argmin^gK ^ (a - af +p{\a\). (5) 
In simple words, the PLSTO of ([5]) shrinks, in some sense that is dictated by p, the size of a. Examples 
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of PLSTO's exhibiting different characteristics are shown in Fig. [2|^b-d) and details together with the 
corresponding Hterature review can be found in App. [Bj AU the thresholding rules of Fig. E^b-d) satisfy 
the properties of Def. I3I4I and Def. 13151 Moreover, they also satisfy the property of Def. I3I6I in their 
respective strict-shrinkage region, i.e., in the case where the lies in the domain of all those r € M 
such that |rpLSTo(''") I ^ I''"!- Notice, also, that we do not impose any regularity conditions on shr, 
like continuity or differentiability, unlike most of the known PLSTO do [13l[19l|23l[27]. As a result, 
any PLSTO, i.e., ([5]), can be used in the place of the shr function in the GT operator. Examples of 
GT having PLSTO's as their shr function are shown in Figs. [2^ and [2^. Moreover, GT where the shr 
function is the Bridge £o.5 ^-nd the Smoothly Clipped Absolute Deviation Penalty (SCAD) threshold are 
used and further discussed in the numerical experiments section. 

4 The APGT Algorithm, Its Properties, and a Novel Operator The- 
oretic Framework 

Algorithm 1 (The Adaptive Projection-based Generalized Thresholding (APGT) algorithm). Given 
the user-defined sparsity level K G 1,L — 1, the sequence of non-negative parameters {en)nGNi the 
number g G N* of the hyperslabs to be processed concurrently at every time instant, the function shr 
for the generalized thresholding operation, and an arbitrary initial point, ao G M^, execute the following, 
for every n G N. 



1. Define the sliding window J7n '■= max{0,n — q + l},n on the time axis, of size at most q. The set 
defines all the indices corresponding to the hyperslabs, which are to be processed at the time instant 

n. Among these, identify Xn '■= {i ^ Jn '■ Psi[ti]{(^n) 7^ o-n], which correspond to the active hyperslabs. 
Moreover, for every i G X^, define the weight w^"^ := where denotes the cardinality of Z„, 

in order to weigh uniformly the importance of the information carried by each hyperslab, S'^ei]- Other, 
more general, scenarios regarding the choice of {w^"^ }|^"^' are also possible. 

2. Collect the projections Ps^[ei][(^n), Vi G X„ (see ([3])). 

3. Choose an e' G (0, 1], and let the extrapolation parameter fin take values from the interval [e'A4n, (2— 
e')-^n]j where 



Mr 



7T) 



^_ 1, otherwise. 



Notice that due to the convexity of the function ||-||^, we always have Ain > 1- As such, the parameter 
fj-n takes values larger than or equal to 2. In general, the larger the fin, the larger the convergence speed 
of the proposed algorithm. 



4. Compute the next estimate by 



O-n+l '— < 



^CT f + Atn g (a„) - j , 



(6b) 
ifX„/0, 



TQ'j^{cLn)i if Xn 
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In order to theoretically support the incorporation of GT into parameter estimation schemes, a novel 
family of mappings, called the partially quasi-nonexpansive mappings, which, to the best of our knowl- 
edge, appears for the first time in the related literature [29]. The reasons for defining this new class of 
mappings are: (i) this family includes as a special case the previously defined generalized thresholding 
operator 7q^\ and, thus, it establishes a general theoretical framework for sparsity-promoting map- 
pings, (ii) it introduces sound theoretical tools, which help to attack non-convexly constrained learning 
problems, and (iii) it generalizes the very recent results, obtained for the Adaptive Projected Subgradient 
Method (APSM) [33], to non-convexly constrained online learning tasks (see App. IdI). 

Although the following discussion can be naturally extended to general Hilbert spaces, for the sake 
of simplicity we focus here on the Euclidean space M^, i.e., T : — )■ M^. A concept of fundamental 
importance, associated to every mapping T, is its fixed point set Fix(r) := |a € : T[a) = a} [2811291 . 
In other words, Fix(T) reveals the hidden modes of T, by putting together all those points unaffected 
by T. To leave no place for ambiguity, every Fix(T) that appears in the sequel is assumed nonempty. 

Definition 4 (The class of partially quasi-nonexpansive mappings). A mapping T is called partially 
quasi-nonexpansive, if 

Va; e M^, C Fix(r) : Vy G K,, 
\\T{x) - y\\ < \\x - y\\ . 

The fixed point set Fix(T) is not necessarily a convex set. Let us also define a stronger version of 
([7]); the mapping T will be called strongly or rj-attracting partially quasi-nonexpansive mapping if there 
exists an ?7 > such that 

Va: G M^, 3Y^ c Fix(T) -.^yeY^, 
7] \\x - T{x)f < \\x -yf - \\T{x) - yf . 

An example of such a mapping ([8]) is the novel generalized thresholding mapping of Section [3] (for a 

proof see App.[C]). In App.[Cl we will also verify that Fix(rQ^^) is a union of subspaces, which is indeed a 
non-convex set. Recall that at the heart of any sparsity-promoting learning method lies the search for a 
solution in a union of subspaces [38U39|. It must be pointed out that a number of well-known mappings, 
e.g., [I0l[29lll6], are special cases of the previously defined class of partially quasi-nonexpansive ones. 

The convergence analysis of the APGT is given by the following Thm. [T] This analysis is based 
on a set of deterministic assumptions, given below. Since the APGT is based on the mapping , 
whose fixed point set (see App. [C]) is non-convex, this is the first time that the results of [311 - 133] are 
generalized to non-convexly constrained online learning tasks. 

Assumption 1. 

1. Assume that 3n G N such that 0„ := M (k) n flieXn Si[ei] ^ 0. Let us explain here the physical 
reasoning behind this assumption. Recall, here, that {Si[€i]}i^Xn is the set of all active hyperslabs 
(see Alg. [TJ, at the time instant n. For an appropriate choice of the parameters (e„)„gN (see (H))), 
the hyperslabs contain the desired a* with high probability. Moreover, as time goes by, and due to a 
long sequence of projections in ([6]), the orbit (a.„)„gN is attracted closer and closer to the hyperslabs; 
and as a consequence, closer to a*. For this reason, it is natural to expect that supp(a„) is similar to 
supp(a*), and hence M {k) to M (k), at some time n. Since M (k) enjoys a non-empty intersection 

with Pliex with high probability, we anticipate that the same also happens to M (k). 

2. Assume that there exists a time instant no G N, and an G N*, such that HnL^^"^ 7^ 0- 



8 



3. Assume that := Uminf,i^oo ■= Un>o C\m>n ^™ ^- other words, we assume that the set 
of all points, which belong to all but a finite number of 0„s, is nonempty. 

Theorem 1 (Properties of the APGT). 

1. Let Assumption lllll hold true. Then, (i(a„_|_i, Qn) < (i(a„, Qn), where d{-,Qn) stands for the (metric) 
distance function [29] to r2„. 

2. Let Assumption I1I2I hold true. Then, 

no+N~l no+N-1 

(f(^ano+N, Pi ^n^<d^(ano, f] ^n^ 

n=no n=no 

- max{d^(a„, Sj[ej]) : j G Jn}. 

^ n=no 

In other words, the previous inequality establishes a bound on the distance of the estimates from a finite 
intersection of the 0„s. If we assume, also, that there exists an estimate a„ which does not belong to 
such an intersection, i.e., 3n' S no, no + — 1 such that max|d^(a„', ^^[ej]) : j € c7n'} > 0, then the 
previous result claims that the APGT forces a^g+Tv to be located strictly closer to fXi^^~^ than 

tt^Q IS. 

3. Let Assumption I1I3I holds true. Then, 

(a) the set of all cluster points of the sequence (a„)„gN is nonempty, i.e., ^((a„)„gi^) 7^ 0. 

(b) lim„_^oo d(a„, 5„[e„]) = 0. In other words, as the time advances, the orbit (a„)„gN approaches 

{Sn[en\)neN- 

(c) C((a„)„gp(i) C Fix(T^^^) = Uje,5'(_ft:L) ^J- words, the APGT generates a sequence of estimates 
(a„,)„gN, whose cluster points are sparse vectors, of sparsity level no larger than K. 

Proof. See Appendix [Dj □ 



5 Numerical Experiments 

In this section, our main intention is to provide the proof of concept of the theoretical findings presented 
in Thm. [TJ This is realized via the performance evaluation of ()6bp . where the shrinkage function shr, 
in Def. El assumes any form of ^plstO' defined in ([5]). This study is not meant to be exhaustive, and 
in order to demonstrate the potential of the proposed technique, the hard thresholding (HT) as well 
as the PLSTOs corresponding to the SCAD [19] and the penalty (7 < 1) [13] are examined, since 
they exhibit distinct characteristics, as it is illustrated in Figs.[2]3 and [2^, respectively. Notice that the 
associated penalty functions are non-convex. The resulting thresholding rules are called the SCAD and 
the Bridge Thresholding (BT), respectively. Notice, also, that SCAD is a piece-wise linear thresholding 
operator, whereas, the BT exhibits strong discontinuity and non-linearity. 

In order to comply with the theory, the SCAD, the BT, and the HT are used as shrinkage functions 
shr in Def. [3l for all the components Xi with index i ^ where K stands for an estimate of the true 

:= ||a*||o- To this end, we have slightly modified the classical SCAD, BT, and HT rules in order to 
fit our need to keep a number of K components of a vector intact. As such, the SCAD thresholding 
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operates according to the following rule; given the input x G MJ" and the output vector z := rscAD(a;)i 
the i-th coordinate of z, where i ^ is given by the next rule: 



0, if \xi\ < A, 

sgn{xi){\xi\- \- 5)+, if G (A,2A], 

if \xi\ G (2A, min{^4^\aA}], 



where A is the regularization parameter, which appears in the definition of the PLSTO in ([5]), a is a user- 
defined parameter, inherent to SCAD |19j . 5 > is a sufficiently small user-defined parameter motivated 
by Definition 13161 and (•)+ := max{0, •}, introduced here in order to leave no place for ambiguities. Our 
modification on the classical SCAD can be seen by the introduction of 5, Cx^\ a-iid (•)+. 



Similarly, given the classical version of the BT rule [18], our modified BT is given as follows by 
involving the quantity in the computations: Vi ^ Jx^\ 

'sgn{xi){zi - (5)+, 

if min{cBT(A,7),eif ^} < \xi\ < \ (10) 
0, otherwise, 

where A is the corresponding regularization parameter in ([5]), 7 G (0, 1) is a user-defined parameter, and 

1 7-1 
1 \ 7-2 / 1 \ 7-2 



cbt(A,7) := -T— ^ -t] +A7 



-^7(7 -1)7 V ^7(7-1), 

The term Zi stands for the solution of the equation Zi + sgn{zi)X-fz]''^ = \xi\. When 7 is set equal to 
0.5, Zi is obtained in closed form by solving a third order polynomial equation. Similarly, HT is given 
by the following rule; Vi ^ Jx^\ 

_ Jo, if \xi\ < min{A,^i^^}, 

I sgn(xi)(|xi| — (5)+, otherwise, 

where the A is introduced here in order to be compliant also to a definition of the HT used often in the 
literature (see the discussion in Appendix [B]). 

In the following experiments, unless otherwise stated, the signal under consideration has L = 1024 
and Kj, = 100. Moreover, the classical CS signal recovery problem is considered, where the input 
(sensing) vectors have independent components drawn from a normal distribution 7\A(0, 1), and the 
observations are corrupted by additive white Gaussian noise of variance = 0.1. Regarding APGT, 
the extrapolation parameter fi^ is set equal to J^n, and the hyperslab parameter := l.3a, \/n. In 
this paper, for all the techniques employed, configurations leading to the fastest convergence rate are of 
principal interest. Prom this perspective, unless otherwise stated, q is fixed to 390 since this appeared 
to be the lowest q value leading to enhanced convergence speed for the specific L and K values. It 
should be stressed out that the method is not sensitive to the parameter q. A larger q value would only 
add to computational complexity without any significant contribution to performance. An extensive 
and complementary experimental study of the APGT performance, in the case where q is confined to 
small values, which relates to very low computational complexity techniques, can be found in |471l48j . 
In all of the succeeding figures, the MSE stands for MSE„ := Yl'i=i W^* ~ (^n{i)\\^ ■, where {an{i))n&i 
is the sequence generated by the i-th realization of Alg. [H and r := 100 is the number of independent 
realizations in order to smooth out the obtained performance curves. 
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5.1 Employing time-invariant thresholding operators 



By the modifier "time-invariant", we mean that the user-defined parameter A in ([5]) remains fixed 
for all the time instants n G N. The performance of ah the employed methods is given in Fig. [3al 
In all cases, K := K^. The regularization parameter A was optimized leading to the values shown 
in the corresponding figure legend. Moreover, APGT-SCAD, without being considerably sensitive to 
parameter a, appeared to perform best when adopting the relatively large value a = 12. 




500 1000 1500 2000 

Iterations 



(a) Time-invariant thresholding, i.e., fixed A. 




Iterations 



(b) Time-adaptive thresholding, i.e., time- varying A. 



Figure 3: (a) Performance study of APGT using thresholding operators which are fixed in each iteration 
and comparison with IPAPA algorithm, (b) Performance study of APGT using thresholding operators 
which are changing in each iteration and comparison with LASSO solution. 

For comparison, the Improved Proportionate Adaptive Projection Algorithm (IPAPA), described in 
|35y36j. is employed. The projection order of the IPAPA, which plays a similar role to g, and therefore 
the same notation is used, is the major factor which dictates its performance. Dashed curves indicated 
with triangles, stars and squares correspond to values of q equal to 50, 100, and 200, respectively. The 
step parameter of the IPAPA is denoted by /i. The best IPAPA performance, i.e., the one depicted with 
a dashed curve with diamonds, is achieved with q = 200 and ^ = 1.8. For lower q values, such a large 
led to unstable performance. In all cases, the parameter /3, which tunes the weights in the proportionate 
algorithrG0, was given the large value (3 = 0.9 in order to exhibit enhanced sparsity promoting behavior. 
When larger q values are used, e.g., q = 400, the performance turned to become somewhat faster, but 
with a quite elevated steady-state error floor, so the corresponding performance curves are not shown. 
Moreover, a set-membership counterpart of IPAPA |34j was also examined. This algorithm performed 
similarly to IPAPA, so the results are not shown to ease visualization. It is clear that the APGT-^o.5 
performs as well as IPAPA. However, this is achieved under a significantly lower computational burden, 
as will be discussed in Section 15.51 



5.2 Employing time-adaptive thresholding operators 

In the previous section, the exact shape of the thresholding function was determined in advance using 
fixed values for the associated parameters, e.g.. A, 7, q, etc. This is quite limiting, since the proposed 

^See parameter a in (2) of [36]. We call it here /? in order to avoid confusion with the parameter a of SCAD. 
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technique has the potential to incorporate time-adaptive a-priori information, in the form of time- varying 
thresholding operators. This section demonstrates that exploiting this freedom leads APGT to enhanced 
performance. In particular, A in ([5]) changes as time n advances. In order to explicitly describe this 
dependency of A to n, we will use hereafter the notation A„. Assuming that an estimate K of the true 
sparsity level K^, is available at each iteration n, parameter A.„ is properly tuned in order to guarantee 
that after thresholding, a fixed number of components will be set equal to zero. With respect to the HT 
operator, in order to achieve a sparsity level equal to K, i.e., L — K components are zero, the quantity 
An should be set equal to Ca^\ V"-- For the SCAD case, A„ := ^Ca^\ ^n, (refer to Q). In this way, the 
SCAD shrinkage behavior is preserved and tuned by the user-defined parameter a. In a similar manner, 
an adaptive BT can be built. Going even further, apart from the K larger in magnitude components 
which remain unaltered, the next, say P, smaller in magnitude components could be shrunk according 
to the bridge rule. This is achieved if we notice that, by definition, ^i^"*"^^ < VP G 1,L — K, and 
that the parameter A„, is defined here as the solution of the following equation ^i^^^^ = CBT(An,7)- In 
particular, for 7 = 0.5, this solution obtains a closed form: 

Vn. (11) 
For convenience, the full GT operator involving the ^0.5 shrinkage is given next: Vi ^ Jx^\ 

_/o, if|..i<ei!:^^\ 

[sgn{Xi){Zi - 5)+, if Ca„ ' <\Xi\<Ca„ , 

where Zi satisfies Zi + tt^^A^ sgn(zj) = IxA, and A^ is given by (fTTI) . 

The performance of APGT methods, using the previous time-adaptive thresholding strategy, here- 
after abbreviated as APGT- AT, is shown in Fig. I3b[ For reference, the dotted curve marked with open 
circles is the one from Fig. [3a] corresponding to the best APGT method with a fixed A. Moreover, 
the best results for the APGT-AT-£o.5 are obtained when P assumes a small integer value, such as 
10. A conclusion that can be easily drawn is that the incorporation of adaptive thresholding led to a 
performance boost. Moreover, the performance achieved depends on the thresholding operator that is 
adopted, with the BT leading to somewhat faster convergence speed compared to SCAD and HT. The 
performance of APWLl, proposed in [llj . is also shown with solid line marked with triangles. It ap- 
pears that the newly proposed algorithms, and especially APGT-AT-^q.Sj succeeds in achieving a similar 
convergence behavior and speed compared to APWLl and, as it will be discussed in Section [5.51 with 
half the computational complexity. For completeness, the Online Cyclic Coordinate Descent - Time 
Weighted Lasso (OCCD-TWL), presented in [7], is depicted with solid line marked with squares. The 
latter is an online algorithm approximating the LASSO problem solution. It is observed, that APGT 
{q = 390), demonstrates a performance competitive to OCCD-TWL, which is an 0{Lp') complexity 
algorithm. 

The advantages of the APGT algorithm over the APWLl are not limited to the performance im- 
provements and/or to computational complexity savings. The proposed theoretical framework is general 
enough in order to include other thresholding operators as well, either existing or newly defined. How- 
ever, the scope of this paper is not a simulation study of all these alternatives of thresholding, and such 
a route will be studied elsewhere. For example, in [U], implementations of the proposed scheme driven 
by a different set of PLSTOs, suitable for low complexity operation, and a novel specially customized 
thresholding operator are presented. In that case, comparison with linear complexity sparsity inducing 
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algorithms, such as the Reweighted Zero Attracting-Least Mean Square (RZA-LMS) [6], ^q-LMS [12], 
and the Sparse Adaptive Orthogonal Matching Pursuit (SpAdOMP) [9] is made in more advanced sce- 
narios, such as system identification with correlated input signal (see [U]) and sparse signal estimation 
corrupted by non-symmetric and/or impulsive noise. 

5.3 Robustness against inaccurate sparsity level estimates 

With the aid of Fig. Hal the effect of over- and under-estimation of is discussed for the reduced 
complexity case of g = 20. We choose a low value for q, since we noticed that such a scenario reveals 
more distinctly the performance sensitivity and related behavior of the APGT with over- or under- 
estimations of K^. Moreover, the use of a low value of q, reveals the performance advantages of the GT, 
compared to other linear complexity algorithms, such as the io-hMS [12]. As it is seen from the Fig. Hal 
the use of the GT mapping results in enhanced performance w.r.t. both APWLl and io-hMS, where 
the latter was fine-tuned for best convergence speed/error floor trade off. In order to have a reference 
of the performance achieved when the true sparsity level is given, the APGT-AT-^o.5 with K = 100, 
is also provided in Fig. Hal Let us start with the under-estimation case and assume that K = 80, i.e., 
20% lower compared to the true sparsity level. Let us take, for example, the APGT-AT-SCAD curve, 
which shows an elevated error floor. Notice that the case of under-estimations of is not supported 
theoretically by Thm. [H With respect to over-estimation, APGT is shown to be very robust. For 
example, let us see the case where is over-estimated by 100%, i.e., K := 2K^. The performance 
achieved by APGT-AT-^o.5 (sohd line with open circles) is still much better compared to the APWLl, 
even if APWLl uses an accurate estimate for the K^,. Moreover, the degradation resulted from such 
a large over-estimation appears to be limited. Remarkably, in this low q case, both APGT-AT-HT 
and APGT-AT-SCAD, drawn with solid lines marked with x-crosses and diamonds, respectively, have 
benefited from the over-estimation. The reason for this is that when q is small, the tentative estimates 
of the unknown vector in each iteration are likely to be not accurate enough in order for the larger of 
them to reveal the true support of the vector. An over-estimated X* leads to less strict HT and SCAD 
thresholding operators, which allow components that would otherwise be set equal to zero, to survive. 
All the results above have been confirmed with higher levels of over-estimation. 

The results are similar when the algorithms operate with higher complexity, i.e., q = 390, with the 
difference that the performance of APGT-AT-HT and APGT-AT-SCAD does not benefited as much as 
previously by an over-estimation of K. The APGT-AT-SCAD and APGT-AT-^o.5 perform similarly, 
so the corresponding curves are not shown. A thorough examination of several scenarios, in the case 
where q attains low values, is deferred to a future work. 

5.4 Tracking ability of the APGT 

Fig. I4bl shows the ability of the tested algorithms to track an abrupt change of the unknown vector a*, 
which is realized here after 1500 observations is examined. This is a typical setting used in adaptive 
filtering [41] community to study the tracking agility of an algorithm. Here, in order to give an essence 
from the CS paradigm, we consider the vector to be not sparse itself but to have a sparse wavelet 
representation. In the first half, the signal under consideration is of length L = 1024, with K^: = 100 non 
zero wavelet coefficients. However, at the 1500 time instant, ten randomly selected wavelet coefficients 
change their values from to a randomly selected nonzero one. Since the sparsity level of the signal 
changes (from 100 to 110, at most) and it is not possible to know A'* exactly in advance, taking 
into account that the methods we propose are quite robust to i^* over-estimations, we set K = 150 
throughout the whole experiment. Moreover, q is set to 390. 
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Iterations Iterations 

(a) Robustness against erroneous estimates of the spar- (b) Robustness against time variations of the desired 
sity level. solution. 

Figure 4: (a) Robustness of APGT-AT in the cases of an under-estimation and an over-estimation 
of Kit = 100, i.e., X = 80 and K = 200, respectively. The g = 20 in these experiments, (b) The 
unknown vector has a sparse wavelet representation which changes abruptly after the reception of 1500 
observations. 



For the OCCD-TWL, an RLS-like forgetting factor lower than 1 is adopted, in order to succeed in 
re-estimating the unknown signal after the abrupt change. More specifically, the value of 0.996 appeared 
to offer a good trade-off between convergence speed and steady-state error floor. However, the OCCD- 
TWL convergence speed slows down after the 1500 time instant, something which was observed and 
discussed in [11] as well. The IPAPA method, catches up quickly after the abrupt change; however, the 
attained error floor is higher than that of the APGT. 

5.5 Computational complexity 

The choice of the thresholding operator affects significantly the overall computational burden for two 
reasons. First, the thresholding function itself requires a larger or smaller number of mathematical 
operations depending on the specific thresholding rule. Such operations can be multiplications, divisions, 
as well as sorting operations. Additions are ignored since they are considered to be much less costly. 
A second attribute of the thresholding rule, which affects complexity, is whether its outcome is a 
sparse vector with a certain sparsity level or not. Indeed, if the thresholding operator produces vectors 
which are, say, i^-sparse, then projections in APGT involve inner products with sparse vectors where 
the number of required multiplications equal to K instead of L. The HT and the GT with Bridge- 
^0.5 shrinkage function, as they where presented in 15. 2^ belong to this category with K = K and 
K = K + respectively. The SCAD thresholding rule does not guarantee a fixed number of zeros 
after its application. This is also the case of the APWLl |llj . Moreover, in the case of the APWLl, 
exact projections onto the weighted £i-ball need to be computed, and in order to do so, the sorting of 
a vector is necessary, which requires in general C'(Llog2 L) operations. However, by adopting a divide- 
and-conquer approach, as in [49], one might reduce the above computational complexity down to 0{L) 
operations. 

The worst-case computational complexities of all the methods employed are given in Tabled) The 
parameter ei is either 1 or 2, depending on whether all uj^^^ of the APGT are given the same value or 
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IVIxiltipliccLtions 


Divisions 


Sortings 


Powers 


APGT-AT-HT 

APGT-AT-^0.5 

APGT-AT-SCAD 

APWLl 

OCCD-TWL 

IPAPA 


(gei +€2 + l)L + {K + ei + l)q 
(gei +e2 + l)L + [K + P + ei + l)q + 12P 1 
(gei 62 + 1)L + {L + ei + l)q + {L - K) 
{qei + e2 + 1)L + {L + a + l)q + SL 
+ 3L 

0{q^) + {q^ +3q + l)L + q 


62 + 1 

P + e2 + 2 

L^K + 62 + 1 
21 + 62 + 1 

L 


0{L) 
0{L) 
0{L) 
0{L) 


3P+ 1 



Table 1: Computational complexities of all the methods employed. 



not. In the examples of this paper the former is the case, i.e., ei = 1. Moreover, parameter 62 is either 
1, if the £2 norm of the input vectors (Wn)neN is arbitrary, or 0, if it is normalized to unity. 



6 Conclusions 

The present paper contributed to sparsity-aware online learning tasks in the following three ways: (i) it 
established a Generalized Thresholding (GT) mapping, which can incorporate as a shrinkage function the 
majority of the thresholding rules found in the literature, (iii) it proposed a non-convexly constrained, 
online learning algorithm for sparse signal recovery tasks with a computational complexity which scales 
linearly to the number of unknowns, and (iii) it introduced a family of mappings which serves as 
the wide functional analytic stage for the study of the previous GT operator. Rigorous discussions 
on the properties of all the previous functional analytic tools, as well as a convergence analysis of 
the proposed algorithm were provided. To validate the theoretical findings regarding our algorithm, 
extensive experiments were conducted, which showed that the proposed methodology offers a sound 
theoretical, and very competitive time-adaptive technique, with lower computational complexity than 
several of the state-of-the-art, sparsity-promoting, online learning algorithms. 



A Convex Sets, Convex Functions, and Projection Mappings 

A subset C of will be called convex, if for any a, a' G C, the line segment {Xa -I- (1 — X)a' : 
X € [0,1]} lies in C. A function 6 : R-^ ^ M is called convex if Va,a' € R^, and VA G [0,1], we 
have &[Xa -|- (1 — A)a') < A0(a) -|- (1 — A)0(a'). The 0-th level set of the convex is defined as 
lev<o(e) := {a G : G(a) < O}. A subgradient of the convex function at a point a, denoted 
as @'{a), is an L-dimensional vector such that {v — a)~^Q'{a) + @{a) < @{v), \/v G M^. In general, 
the number of the subgradients of at a is infinite. The set of all subgradients of at a point a is 
called subdifferential, and it is denoted by 90(a). In the case where is differentiable at a, then the 
subgradient 0'(a) is unique, and it is nothing but the gradient of at a. 

Given a closed convex C C M^, define the (metric) distance function d{-, C) : ^ M to C as follows: 
Va G M^, d(a,C) := inf{||a-t;|| : f G C}. Notice that d{-,C) is convex with lev<o (i(-,C) = C. The 
(metric) projection onto C is defined as the mapping Pc -.R^ ^ C, which maps an a G to the 
unique Pc{o-) £ C, such that 110 — ^(7(0)11 = d{a,C). For example, the subdifferential of d{-,C) is 
given as follows: 

ddia,C) = ^^^ (13) 



15 



where Nc{a) := {v eR^ : {y -a) <0,\/y e C}. 



B The Penalized Least- Squares Task 

Going back to ([1]), choose N S N*, and define Un ■= [un,Un-i, ■ ■ ■ ,Un-N+i] £ M^^^, as well as 
Vn ■= [2/n,yn-i, • • .,yn-N+iV ^ ^nd Vn ■= [vn,Vn-i, ■ ■ .,Vn-N+iV ^ ■ Then, it Can be easily 
verified that ([1]) takes the form of yn = + Vn, Vn G N. The mainstream of the batch sparsity- 

promoting algorithms utilize all the gathered N training data to find an exact or approximate solution, 
in most cases iteratively, to the following penalized least-squares minimization task, 



1 

mm — 



yn - J7„ a 



+ X^pi\ai\), (14) 



where p : M ^> [0, oo) stands for a sparsity-promoting and non-convex, in general, penalty function, 
A S (0, oo) is the regularization parameter, and Oj stands for the i-th coordinate of the vector a. 

Choices for p are numerous; if, for example, p{\a\) := XR\{o}(|f^l)i G M, where stands for the 
characteristic function with respect to jz/ C M, i.e., '■= 1, if a S and x^(ck) •= 0, if a ^ £/, 

then the regularization term J2i=i Pi\^i\) becomes the ^o-norm of a. In the case where p{\a\) := \a\, 
Va S M, then the regularization term is nothing but the ^i-norm ||a||-|^ := \cLi\, and the task 



(jl4p becomes the celebrated LASSO [M]. However, it has been observed that if some of the LASSO's 
regularity conditions are violated, then LASSO is sub-optimal for model selection [T3l[20t l22 | [241127] . 
Such a behavior has motivated the search for non- convex penalty functions p, which bridge the gap 
between the Iq- and ^i-norm; for example, the penalty, for 7 € (0, 1), [13], the log [18], the SCAD 
[lailg], the MC+ [MIET], and the transformed h [IS] penalties. 

Recently, sparsity-promoting coordinate-wise optimization techniques for solving the task (jl4p are 
attracting a lot of interest [7l l26l[27] . To be more concrete, assume, for example, that N = L, and that 
the matrix Un is orthogonal. Byy defining a„ := Unyn, (dH) can be equivalently viewed as the following 
separable optimization task [T8|[2T]. 

a&R^ ^ 2A 
1=1 

Under some mild regularity conditions on p |18j . the minimization task of (jlSp possesses a unique 
minimizer. Due to the separability of (jlSp in coordinates, the minimization task of (jlSp can be viewed 
as a task defined on an 1-dimensional axis, instead of an L-dimensional domain. Accordingly, the 
problem reduces to the univariate PLS task described in (jH). 

Figs. [2]^b-d), show the thresholding functions (PLSTO, see ([5|)), which solve ([4]) for some of the 
most commonly employed penalty functions. For example, if p(|a|) := [A^ — (|a| — A)^X[o,A)(l'2|)] /-^j 
Va G M, then the resulting PLSTO is the celebrated Hard Thresholding (HT) mapping [18], which is 
depicted in Fig. [2^ together with the well-known Soft Thresholding (ST) mapping which results in the 
case where p(|a|) := |a|, i.e. is chosen such that to lead to the LASSO task. Note that both ST and 
HT operators have been effectively employed in iterative thresholding schemes for fast sparse signal 
recovery under the compressed sensing framework [3H5ll50j. The rest of the thresholding rules, shown 
in Fig. [2b correspond to the MC+ penalty [2ll[27] and the SCAD [H], respectively. Both SCAD and 
MC+ leave large components unchanged, like HT, while avoiding being discontinuous and at the same 
time allowing a linear/gradual transition between the "kill" and the "keep" areas of HT. HT is far from 
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being the only discontinuous thresholding operator. An example is shown in Figs. [2]3, by the widely 
known Bridge threshold |13j . which is related to the penalty, 7 < 1 [51]. Note that this thresholding 
rule comprise nonlinear segments. Continuous thresholding functions, that contain nonlinear parts, are 
shown in Fig. [2jd). More specifically, the non- negative garrote [16] and representatives of the n-degree 
garrote threshold are shown. Similar thresholding functions are also the hyperbolic shrinkage rule [17] 
and PLSTO's stemming from the nonlinear diffusive filtering approach |21j . 



C Properties of the GT Mapping 

Theorem 2. 



1. V..eM^/fi. =4^). 

GT ^'^^ 

2. Fix(TQ^^) = Uj65^{i^ L) ^J- Notice, here, that Fix(TQ^^), as a union of subspaces, is non-convex. 

3. Let a sequence {Xn)ne'N C M-^ and an € M^. If lim„__s.oo Xn = x^, and lim„_s.oo (-^ — ^gt^) ~ 
then a;* G Fix(TQ^^). This property can be rephrased as / — 7q^^ being demiclosed at 



4. Tq^^ is 1-attracting partially quasi-nonexpansive, i.e., \/x € M^, \/y € Mj(k), x — Tq^\x) 



2 



< 



- y\\ 

Proof: 

1. Define z := Tq^\x). In order to derive a contradiction, assume that //^ + Jf\ Since both 

Jx^\jz^^ have the same cardinality, the previous assumption means that there exist /qi^o such that 

^0 e J^'^ \ and Iq € Jz^'' \ Hence, jx^/^l = |shr(x,^)| = > \zi^\ = jx^J > |x,jj. The 

previous result implies that \xi^^ \ = which, in turn, suggests by the definition of that /q < ^o- 

Moreover, \ziij\ = \ziq\ and I'q < Iq by the definition of Thus, Iq < I'q < Iq, which is absurd. This 

contradiction establishes the claim of Thm. I2I1[ 

2. Pick any x G Uje5^(KL) ^J- ^^^y *° verify by Def. [3] that T^\x) = x, i.e., x G Fix(TQ^''). 
To prove the opposite inclusion, assume any x G Fix(TQ^^), i.e., Tq^\x) = x. Since V/ G the 
relation T^^{x) = x leads to the trivial result xi = xi, we deal here only with the more interesting 

case of / ^ Jx^^ ■ For such an /, according to Def. [3l we must have shr(x;) = xi, which implies that 
I shr(x;)| = |rE;|. However, by the properties of shr, given in Defs . [3l5] and [3l6t we necessarily obtain that 
XI = 0. Since this holds V/ ^ Def. [2] suggests that x G M (x). Now, recall that J^^^ G ^{K,L) 

to establish the inclusion x G Ujej^(XL) 

3. (a) Assume, for a contradiction, that there exists an e > and a subsequence (rafc)fcgN, such that 

By Def. I3I6| 36 > such that |shr(x„j,^;^^)| < |a;nfc,«„j.| — 6, \/k. Then, it is easy to verify that VA:, 
W^Uk - shr(x„,,,/„j| > \xna-a,^ I - |shr(2;„^,i„j| > | - \+6 = 5. This implies that VA;, 



{Xn^,l - shv{Xn^^l)Y > {xn^,!^^ " shr(x„^,;„ j) ^ > 6^. (16) 
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Notice that 

' GT 



[l — ^cT^) i^n) = J2id j'-^') {^n,i — shr(x„^i))^. Hence, the assumption that liuin^ooil — 
Tp,^,^) (xn) = imphes that for the 5 of (fTBIl . 3no E N such that Vn > no, T],^ ak) (xn ; — shr(x„ /))^ < 5^. 
This contradicts (jl6p . In other words, our initial claim is wrong, and the contrary proposition becomes: 
Ve > 0, there exists an ng € N such that \xj^^i\ < e, V/ ^ Jx^^ \ ^ ^o- This can be equivalently written 
in a more compact form as follows: 



lim max{|x„;| : / ^ J^^^] = 0. 



(17) 



(b) Let us define here 



lim inf Jl 



u n^. 

n=0 m=n 



(^) 



(18) 



In words, J^o contains all those points which belong to all but a finite number of J^\. There are two 



cases regarding J^f ■* and Jqo ; either J^" ' n Jqo / or Jxl ' H J( 
the case where Jqo = 0- Let us examine each case separately. 

i. The case of PI Jqo 0- 



Notice that the latter covers also 



A. Assume that J^f"^ C Jqo- This implies that there exists an no such that Jx^^ C nn,>no "^a;^^- Since 



both Jx^ and Ji^ have the same cardinality, i.e., K, we obtain that Vn > no, Jx, 



. Choose 



any / ^ j'^'' = J^^\ By ^3), lim^^oo 2;„,/ 

B. Assume now that Jx^^ ^ Jqo- Hence, there exists an / E Jx^^ and a subsequence {nk)keN such that 

/ = = x^: I. Since / G Jxf\ we clearly have that x^, = 0, 



= x^,^. Thus, V/ ^ Jx^ , x^,^/ = 0, or equivalently. 



I i VA: G N. By (HID, lim„, 

V/' i- Jif ^ Hence, a;* G Uje5^{i^,L) ^^-^ 



ii. The case of Jif ^ n Joo = 0. This means that there exists an / G Jx, and a subsequence {nk)h&{ 
such that / ^ '^^k^ ^ ^' similarly to our previous arguments, a;* G \}j<^:y(^KL) ^^J- 



AK) 



4. Define Rk ■= 2T^^ - I. Given any x G M^, let z := T^'{x), as in Def. H Then, verify that 



(K), 



\\Rk{x) 



y\ 



Y{2zi -xi- yif 
1=1 

{xi-yif+ ^ (2shr(x/) - Xi) 



(K) 



14. J, 



The previous inequality is obtained from the observation that the properties of shr in Def. [3] suggest 
shr^(x/) < xi shr(x;), and from the following elementary calculations: (2 shr(x;) — x;)^ = 4shr^(x;)+x^ — 
4x; shr(xz) < 4shr^(xi) + xf - 4shr^(x/). Hence, Va; G M^, Vy G M ^k), \\Rk{x) - y|p < \\x - 



2T, 



(K), 



GT 



x) — X — y 



< \\x - y\ 



2(Tg\x)-y)-{x-y) 



< \\x - y\ 



< 
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T^{x) —y , where in order to obtain the last equivalence we used some elementary 



If - y\ 

algebra, and the fact 
Thm.[2 



\x-y\\ + 



X 



p(K) 

'gt 



{x) 



. This establishes the claim of 



D Proof of Theorem [T] 



Let us define first a sequence of convex functions (0n)neN in an inductive way. Given the time index 
n, and the estimate an G K^, define the following convex function; Va € M'^, 

uj["^d{an,Silei\) 



QJa) := < 



d{a, Si[ei]) 



0, 



ifX„/0, 
otherwise, 



where L„ := J^^gj^ wj"'^(i(a„, Sjlej]). It is easy to verify by the definition of X„, that if X„ ^ 0, then 
Vz G X„, (i(a„, 5j[ej]) > 0, and thus L„ > 0. Moreover, if Z„ = 0, then ©'„(a) = 0, Va. 

Let us look closer to and especially only the interesting case of In ^- By standard subgradient 
calculus, it can be verified by ([T3]) that 



(a„ -Ps^[,^](a„)). 



Thus, whenever In 7^ 0, we have 6'„(a„,) = iff X^jgi^ '^j"^ — Ps^[e^]{o,n)) = 0. Hence, for some 
user-defined parameter A„ > 0, it is straightforward to see that 



If we let An := fin/-^n, then an examination of ([6]), for both the cases of 7^ and In = 0, implies 
that the proposed algorithm can be rephrased as follows; for := /J-n/Mn S W , 2 — e']. 



4? (an - A„ J#^e'„(a„)) , e;(a„) / 0, 

e;(a„) = 0. 



GT 
GT I""/'' 



1|e;{a„)| 



(19) 



1. Fix any v G r2„, and assume that 0^(a.„) 7^ 0. Then, since v G M (k) C Fix(rQ^^) 



|an+i - v\ 



GT 



Qn(Qn) 

G' a, '"^ 



0'n(an) 



< 



(a„ - ti) - A^ 



'n\"'nj 



< \\an - v\ 



A„(2 - A 
ie') 



2 0n(«n) 



|0n(an) 



/n2 0n('^n) 



|2 ' 



(20) 
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where we used the property of Thm. 12141 and the definition of the subgradient 0^(a„). As a result, 
||a„_i_i — v\\ < ||a„ — v\\. Notice, that this holds true also for the case where 0^(a„) = 0. Now, if we 
apply inf„gf7^ on both sides of the previous inequality, then we establish the claim of Thm. Illll 



2. Fix n G no,no + — 1. Assume that 0^(a„) ^ 0. Then, notice by the convexity of the function 



P that 



> - ^ d^{an,Si[ei]) 



2 



> ^max{(f {an, Sj[ej]) : j G J^}. (21) 



Hence, by pTjl . 



|a„+i - ijIP < ||a.„ - flp 



./\2 



max{d\a^,S.j[ej]) :j e Jn}. (22) 



Notice also that (j22p holds true also for the case where G^(a„) = 0. If we take the infimum over 

l7i=rio 



all V G rini^^ ^ on both sides of ([22]). and if we add the resulting inequality for all values of 



n G no, no + A^ — 1, then the claim of Thm. [112] is established. 

3. (a) Choose arbitrarily any v Q. Then, by definition, there exists an no such that v G C\n>no 
Clearly, (j22p leads to ||a„+i — ?;|| < ||a„ — Vn > no, i.e., {\\an — v\\)n>no is monotonically non- 
increasing, and thus convergent. This result implies also that the sequence (a„)„gpj is bounded, and 
that e:((a„)„6N) /0 [29]. 

(b) Let V G and no as previously. We have already seen that \/v G 0, (||an — i»|p)„gN is convergent. 
Thus, it is Cauchy, and 

lim (||a„ - v\\'^ - ||a„+i - ^;||^) = 0. (23) 



Now, a simple inspection of (j22p and (|23p establish the claim of Thm. Ill3bl 

(c) Notice that Vn > no, 0'„(a„) = iff G lev<o(0n) = fliex^ Si[ei] ^ 0. Hence, for such n, (fT9]) 
takes the following equivalent form: 

a„+i=Tg)r4";)(a„), (24) 

where the mapping Tq^ is the subgradient projection mapping with respect to the convex 0„ defined as 
[29]: rg;V) := a - A„^^e;,(a), if a ^ lev<o(e„), and rg;V) := a, if a G lev<o(G„). The 

equations (pO|) and ([23]) imply that lim.„_^oo n^"/"" L = 0. It is a matter of simple algebra to show also 



that 



^^WJ^ - ^l|e;(a„)||- such, 



lim_(/-r£"))(a„) = 0. (25) 



n— >oo 
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A remarkable property of the subgradient projection mapping is the following [29]: Va € M , Vi; € 

lev<o(e„) / 0, 



2 — \n 

An 



a-T£")(a) 



By (IMI and Thm. EUl we can see that ji^^^ = J^^^^, . Now, notice by Thm. EH and ([26]) that 



< a — u 



riK) 



(26) 



< 
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a„+i - vf 



2 — Xn 
An 



an-r4;"^(a„) 

2 



\an+i-v\\ <\\an-v\\ -\\an+i-v\ 



(27) 



Thus, by (j23]l . we obtain 



linr(l-r^))Tg^")(an)=0. 



(28) 



By Thm. Ill3a| choose any a* G C((a„)„gN). Thus, there exists a subsequence (anj.)^^^ such that 
limfc_>.oo cinfe = o^,. Hence, by (p5|) . lim/c_j.oo T'q'^"'' ^ (ttn^. ) = a*. This result, (p8]) . and Thm. [2l3] lead to 
a, G Fix(r^^)). S ince a,,, was chosen arbitrarily, we obtain the desired C((a„)n6N) C FIx^Tq^^). 
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